병렬 프로그래밍 개요#

강좌: 수치해석 프로젝트

개요#

  • 반도체의 성능은 Moore의 법칙 수준으로 향상되어 왔음

    • 단일 코어의 성능 향상 폭은 줄어들음

  • 여러 컴퓨터를 묶은 클러스터 방식의 슈퍼컴퓨터

  • Multi-core, Many-core 프로세서를 활용한 병렬 계산

    • GPU를 이용한 인공지능 학습

컴퓨터 구조#

  • 폰 노이만 구조

    • CPU, 메모리, 저장장치, 네트워크 등으로 구성됨

vn-fig

Fig. 14 Von Neumann architecture (From Wikipedia)#

  • CPU : ALU 와 CU, 그리고 캐시로 구성됨

    • 멀티코어 프로세서

dualcore-fig

Fig. 15 Dual Core Processor (From Wikipedia)#

  • SIMD (Single instruction, multiple data) : 벡터 계산 (MMX, SSE, AVX, neon)

simd-fig

Fig. 16 SIMD (From Wikipedia)#

  • Memory : 메모리 속도는 상대적으로 덜 빨라짐

    • 종류 : DDR, GDDR, HBM

  • Network

    • 라우터, 선, 카드

    • 종류 : Ethernet (1G, 10G), Omnipath, Infiniband

병렬 프로그래밍 모델#

Shared Memory Programming#

  • 공유 메모리를 이용해서 여러 프로세스 (또는 쓰레드)가 데이터를 공유하면서 병렬 계산

  • 라이브러리 : OpenMP, POSIX thread (pthread), Intel TBB

  • Fork and Join Model

fork_join-fig

Fig. 17 Fork and Join model (From Wikipedia)#

  • Race condition 을 조심해야 함

Message Passing Model#

  • 각 프로세스가 독립된 메모리를 가지고 있으며 통신으로 자료 교환하면서 병렬 계산

  • 라이브러리 : MPI (MPICH, OpenMPI, MS-MPI, Intel MPI)

mpi-fig

Fig. 18 Message Passing model (From KSC)#

  • Dead lock을 조심해야 함

병렬 계산 성능#

Amdahl’s law#

전체 코드 중 \(p\) 만큼만 병렬화해서 \(N\) 배 빨라졌을 경우 총 성능 향상은 \(S\) 임.

\[ S = \frac{1}{(1-p) + \frac{p}{N}} \]
fork_join-fig

Fig. 19 계산 성능 비교 (From Wikipedia)#

Python 병렬 프로그래밍#

Numba#

  • prange를 이용한 Loop 자동 병렬 기능 제공

  • Threading layer에 따라 OpenMP, Intel TBB 등을 제공

mpi4py#

  • MPI 라이브러리 바인딩

예제#

Laplace 코드를 Fork and Join model로 병렬화 하시오

import numba as nb
import numpy as np

# Use OpenMP
from numba import config
config.THREADING_LAYER = 'omp'

# For Intel MKL as BLAS and LAPCK
import mkl
def solve_laplace(n, solver, tol=1e-5, order='C'):
    """
    Laplace Equation solver
    
    Parameters
    ----------
    n : integer
        size
    solver : function
        iterative solver
    tol : float
        tolerance
    order : string
        'C' | 'F'
        
    Returns
    -------
    err : float
        residual
    """
    ti = np.zeros((n+2, n+2), order=order)
    dt = np.zeros((n+2, n+2), order=order)

    def bc(t):
        t[-1, 1:-1] = 300
        t[0, 1:-1] = 100
        t[1:-1, -1] = 100
        t[1:-1, 0] = 100

    err = 1
    hist = []
    while err > tol:
        # Apply BC
        bc(ti)

        # Run Gauss-Seidel
        solver(n, ti, dt)

        # Compute Error
        err = np.linalg.norm(dt) / n
        
    return err
@nb.njit(fastmath=True)
def jacobi_nb(n, ti, dt):
    """
    Jacobi method
    
    Parameters
    ----------
    n : integer
        size
    ti : float
        current time
    dt : array
        difference
    """
    for i in range(1, n+1):
        for j in range(1, n+1):
            dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
            
    # Update
    ti += dt


@nb.njit(fastmath=True, parallel=True)
def jacobi_nbp(n, ti, dt):
    """
    Jacobi method
    
    Parameters
    ----------
    n : integer
        size
    ti : float
        current time
    dt : array
        difference
    """
    for i in nb.prange(1, n+1):
        for j in range(1, n+1):
            dt[i, j] = 0.25*(ti[i-1, j] + ti[i, j-1] + ti[i+1, j] + ti[i, j+1]) - ti[i, j]
            
    # Update
    for i in nb.prange(n+2):
        for j in range(n+2):
            ti[i,j] += dt[i,j]
n = 2048
%time solve_laplace(n, jacobi_nb, tol=5e-3)
/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/__init__.py:3: UserWarning: The module `llvmlite.llvmpy` is deprecated and will be removed in the future.
  warnings.warn(
/home/jinseok/miniconda3/envs/idp/lib/python3.9/site-packages/llvmlite/llvmpy/core.py:8: UserWarning: The module `llvmlite.llvmpy.core` is deprecated and will be removed in the future. Equivalent functionality is provided by `llvmlite.ir`.
  warnings.warn(
CPU times: user 6min 3s, sys: 5.83 s, total: 6min 8s
Wall time: 23.4 s
0.004998731199856652
# At AMD Threadripper 5955wx (16C32T)
for i in [1, 2, 4, 8, 16, 32]:
    # Adjust number of threads for numba and MKL
    nb.set_num_threads(i)
    mkl.set_num_threads(i)
    print("Number of Threads :", i)
    
    # Measure time
    %time solve_laplace(n, jacobi_nbp, tol=5e-3)
Number of Threads : 1
CPU times: user 27.3 s, sys: 79.4 ms, total: 27.4 s
Wall time: 24.6 s
Number of Threads : 2
CPU times: user 22.9 s, sys: 32 ms, total: 23 s
Wall time: 11.5 s
Number of Threads : 4
CPU times: user 23.8 s, sys: 60 ms, total: 23.9 s
Wall time: 5.97 s
Number of Threads : 8
CPU times: user 26.2 s, sys: 92.1 ms, total: 26.3 s
Wall time: 3.29 s
Number of Threads : 16
CPU times: user 33.3 s, sys: 192 ms, total: 33.5 s
Wall time: 2.09 s
Number of Threads : 32
CPU times: user 2min 20s, sys: 1.87 s, total: 2min 22s
Wall time: 4.48 s